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Recent analyses have calculated the minimal thermodynamic work required to perform a computation n 
when two conditions hold: the output of n is independent of its input (e.g., as in bit erasure); we use a physical 
computer C to implement n that is specially tailored to the environment of C, i.e., to the precise distribution 
over C’s inputs, P Q . First I extend these analyses to calculate the work required even if the output of n depends 
on its input, and even if C is not used with the distribution P 0 it was tailored for. Next I show that if C will 
be re-used, then the minimal work to run it depends only on the logical computation n, independent of the 
physical details of C. This establishes a formal identity between the thermodynamics of (re-usable) computers 
and theoretical computer science. I use this identity to prove that the minimal work required to compute a bit 
string cr on a “general purpose computer” rather than a special purpose one, i.e., on a universal Turing machine 
U, is k B T ln(2)[Kolmogorov complexity(cr) + log (Bernoulli measure of the set of strings that compute cr) + 
log(halting probability of l/)]. I also prove that using C with a distribution over environments results in an 
unavoidable increase in the work required to run the computer, even if it is tailored to that distribution over 
environments. I end by using these results to relate the free energy flux incident on an organism / robot / 
biosphere to the maximal amount of computation that the organism / robot / biosphere can do per unit time. 


There has been great interest for over a century in the rela¬ 
tionship between thermodynamics and computation (2] [6, ITTl 
[El[I3|2MIl[3a[39l|46li2lll9l|5lEl- A breakthrough was 
made with the argument of Landauer that at least kT ln[2] of 
work is required to run a 2-to-l map like bit-erasure on any 
physical system 0}@] [121 [18] 126^2811331 l35l [42] l45l . a con¬ 
clusion that is now being confirmed experimentally ||5l[T3ll24l 
[25] 00]. A related conclusion was that a l-to-2 map can act 
as a refrigerator rather than a heater, removing heat from the 
environment Misa. For example, this occurs in adiabatic 
demagnetization of an Ising spin system |[26l . 

This early work leaves many issues unresolved however. In 
particular, say any output can be produced by our map, with 
varying probabilities, from any input. So the map is neither 
a pure heater nor a pure refrigerator. What is the minimal 
required work in this case? 

More recently, there has been dramatic progress in our un¬ 
derstanding of non-equilibrium statistical physics and its rela¬ 
tion to information-processing J7] [9] [TO] Q21 fl4m6l l20l l23l 
M M EZ1 MM HU ED. Much of this recent literature 
has analyzed the minimal work required to drive a physical 
system’s (fine-grained, microstate) dynamics during the inter¬ 
val from t — 0 to t = 1 in such a way that the dynamics of 
the macrostate is controlled by some desired Markov kernel 
n. In particular, there has been detailed analysis of the mini¬ 
mal work needed when there are only two macrostates, v = 0 
and v = 1, and we require that both get mapped to the bin 
v = 0 031 l34l HE By identifying the macrostates v 6 V 
as Information Bearing Degrees of Freedom (IBDF 11) of an 
information-processing device like a digital computer, these 
analyses can be seen as elaborations of the analyses of Lan¬ 
dauer et al. on the thermodynamics of bit erasure. 

Many of the work-minimizing systems considered in this 
recent literature proceed in two stages. First, they physi¬ 
cally change an initial, non-equilibrium distribution over mi¬ 


crostates to the equilibrium distribution, p eq (w), in a quench¬ 
ing process. All information concerning the initial microstate 
is lost from the distribution over w by the end of this first stage. 
So in particular all information is lost about what the initial 
bin vo was. In addition, the Hamiltonian used in this quench 
is defined in terms of Pq, the initial distribution over computer 
inputs. There is some unavoidable extra work if the computer 
is used with an initial distribution that differs from Pq. 

Next, in the second stage p eq {w) is transformed to an end¬ 
ing (non-equilibrium) distribution over w, with an associated 
distribution over the ending coarse-grained bin, vi. However 
since all information about vo has been lost by the beginning 
of the second stage, vq cannot have any effect on the distri¬ 
bution over vi produced in the second stage. Accordingly, 
changing the distribution over inputs to one of these systems 
has no effect on the distribution over outputs. So although 
such a system can be used to implement a many-to-one map 
over the IBDF (i.e., the bins) in a digital computer, it cannot 
be used to implement any computational map whose output 
varies with its input. 

In this paper I show how to implement any given condi¬ 
tional distribution k with minimal work, even if n maps dif¬ 
ferent initial macrostates vo to different final macrostates vq. 
I do this by connecting the original, processor system with 
macrostates v 6 V to a separate, initialized “memory system” 
that records vo, and then evolve the joint system in such a way 
that the processor dynamics effectively samples n(. \ vo). Af¬ 
ter this the memory is re-initialized (i.e., the stored copy of vo 
is erased), completing the cycle. 

Like the systems considered in the literature, those consid¬ 
ered here are implicitly optimized for some “prior” distribu¬ 
tion over the inputs, Here I go beyond the analyses in 
the literature by allowing the actual distribution over inputs, 
Pq(v ), to differ from our assumed distribution, Sfo- When 
% = Po, the dynamics of the joint system is thermodynam- 
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ically reversible. So the second law tells us that there is no 
alternative system that implements n with less work. How¬ 
ever if % + r A) (i.e., the computer is used with a different 
user from the one they are optimized for) and n is not just a 
permutation over v, some of the work when the memory is 
reinitialized is unavoidably wasted. I then analyze the situ¬ 
ation where there is a distribution over Pq (e.g., as occurs if 
the system is a computer that will be used with multiple users, 
or if it is an organism that will experience different environ¬ 
ments) and % is optimized for that distribution, deriving how 
much extra work is needed due to uncertainty about who the 
user is. 

I also show that if the physical system used to run the com¬ 
putation will be re-used , then the “internal entropies”, giving 
the entropy internal to each coarse-grained bin, do not con¬ 
tribute to the minimal work. In such a scenario the specifics of 
the physical system implementing the computation — which 
are reflected in those internal entropies — are irrelevant. The 
work depends only on the computation n implemented by that 
system. (In previous analyses the computer was not re-used, 
so the internal entropies — and therefore physical details of 
the computer — were relevant.) This result establishes a for¬ 
mal identity between the thermodynamics of (re-usable) com¬ 
puters and theoretical computer science. 

As an illustration, I use this identity to analyze the ther¬ 
modynamics of a “general purpose computer” rather than a 
special purpose one, i.e., of a universal Turing machine U, 
where the macrostates are labelled by bit strings. In particu¬ 
lar I prove that the work required to compute a particular bit 
string cr on t/ is k B T ln(2) times the sum of the Kolmogorov 
complexity of cr, log of the Bernoulli measure of set of all 
strings that compute cr, and log of the Halting probability for 
U. Intuitively, by considering all input strings that result in 
cr, the second term quantifies “how many-to-one” U is, some¬ 
thing that is not captured by the Kolmogorov complexity of 
cr. 

I end by using these results to relate the free energy flux in¬ 
cident on an organism (robot, biosphere) to the maximal “rate 
of computation” implemented by that organism (resp., robot, 
biosphere). 

I refer to the engineer who constructs the system as its “de¬ 
signer”, and refer to the person who chooses its initial state 
as its “user”. While the language of computation is used 
throughout this paper, the analysis applies to any dynamic pro¬ 
cess 7i over a coarse-grained partition of a fine-grained space, 
not just those processes conventionally viewed as computers. 
So for example, the analysis applies to the dynamics of biolog¬ 
ical organism reacting to its environment, if we coarse-grain 
that dynamics; the organism is the “computer”, the dynamics 
is the “computation”, the “designer” of the organism is nat¬ 
ural selection, and the “user” initializing the organism is the 
environment. 

Problem setup — I write \X\ for the number of elements x in 
any finite space X , and write the Shannon entropy of a dis¬ 
tribution p over X as S P (X) - S(p ) = - p(x) \n[p(xf\, or 


even just S (X) when p is implicit. I use similar notation for 
conditional entropy, etc. I also write the cross-entropy be¬ 
tween two distributions p and q both defined over some space 
X as C{p{X) || q(X)) = - Yj v p(x) ln[g(A)] or sometimes just 
C(p\\ q) for short l|8]}321. 

Let W be the space of all possible microstates of a sys¬ 
tem and 'V a partition of W, i.e., a coarse-graining of it into 
macrostates. For example, in a digital computer, 'V maps each 
microstate of the computer, w e W, into the bit pattern in the 
computer’s memory. I assume that the set of labels of the par¬ 
tition elements, V, contains “0”. When convenient, I subscript 
a partition element with a time that the system state lies in that 
element, e.g., writing, vo,Vi, etc. 

The Hamiltonian over ff at / = 0 is Hf , with associ¬ 
ated equilibrium (Boltzmann) distribution p eq . For simplic¬ 
ity, I assume that Vv e V, at the two times t - 0 and t = 1, 
Pr(w | v) is the same distribution, which I write as q v in {w). 
(N.b., q] n {w) — 0 if 'V(w) V v ; .) As in the analyses of com¬ 
puters in f2j-[4][26), there is a “user” of the system who inter¬ 
venes in its dynamics at or before t - 0, which results in the 
initial macrostate vo <£ V being set by sampling a user dis¬ 
tribution Po(v). As examples, Pq could model randomness 
in how a single user of a computer initializes the computer at 
t - 0, or randomness in how an environment of an organism 
perturbs the organism at t = 0. I write the (potentially non¬ 
equilibrium) unconditioned distribution over W at t = 0 as 
Poiyv) = X v P 0 (v)q v m (w). 

The evolution of the microstates w e W during t e [0,1) 
results in a conditional distribution over macrostates, 7r(v’i | 
Vo). Since they are set by the designer of the system, I take n 
and the distributions q v in to be fixed and known to that designer. 
However I allow the designer to be uncertain about what Pq 
is. As shorthand, I write P\{v) = YivPo(v)n{v I v) 

I wish to focus on the component of the thermodynamic 
work that reflects computation, ignoring the component that 
reflects physical labor. This is guaranteed if the expected 
value of the Hamiltonian at t = 0 and t — 1 is the same, 
regardless of Pq and Pi, since that means that the change 
in the expected value of the Hamiltonian is zero. Accord¬ 
ingly I assume that at both t - 0 and t — 1, the expected 
value of the Hamiltonian if the system is in state v then (i.e., 
Ew l'j n ( w )Hf ys (w)) is a constant independent of v. I write that 
constant as hf ys . To simplify the analysis below, I also assume 
that E p ^[Hf ys (w)] = h® ys . 

Overview of the system — The designer’s goal is to modify 
the system considered in Ifl5ll34ll4ll into one which no longer 
loses the information of what the initial macrostate vo was as it 
evolves from t = 0 to t = 1. This can be done by coupling the 
system with an error-free memory apparatus, patterned after 
the measurement apparatus introduced in ll34l |4T; 421. As in 
those studies, the “measurement” is a process that copies the 
macrostate to an initialized, external, symmetric memory with 
the value of Vo, and does so without changing vq (or even the 
initial microstate of the processor, wo). Having set the value 
of such a memory, we can use its value later on, to govern the 
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FIG. 1. Example of dynamics of the marginal distributions of a sys¬ 
tem with a binary coarse-graining, where the bins have the same size 
and both q] n (w) and Q"\u ) are uniform for all v, m. The top row 
shows the dynamics of the processor, and the bottom row shows the 
dynamics of the memory. Fig. (a) shows the / = 0 state, with the right 
bin of the processor more probable than the left bin, and the memory 
is in its initialized bin. Fig. (d) shows the t = 1 state, where the rela¬ 
tive probabilities of the processor bins have changed according to 7i, 
and the memory has been returned to its initialized bin. 


dynamics of w after the time when the the system distribu¬ 
tion has relaxed to p eq {w), to ensure that v evolves according 
to 7r — even if under n the ending macrostate of the system 
depends on its initial state. Finally, to complete a cycle, the 
memory apparatus must be reinitialized. 

I assume that the memory and system are both always in 
contact with a heat bath at temperature T. To be able to store a 
copy of any v e V, the memory must have the same set of pos¬ 
sible macrostates, V. I write the separate memory macrostates 
as m e V, with associated microstates u e U. (A priori, U 
need not have any relation to W.) For simplicity I assume 
that the conditional distribution of u given any m is the same 
distribution Q"'(u) at both 1 — 0 and t — 1, and that there 
is a uniform equilibrium Hamiltonian H® em {u). In addition, 

I make the inductive hypothesis that the starting value of the 
memory is m = 0, with probability 1. The system dynamics 
comprises the following four steps (see Fig. [TJ: 

I — First the memory apparatus copies the initial value v = vo 
into the memory, i.e., sets m = vo- This step is done without 
any change to w, and so 'Pq(w) is unchanged. Since the copy 
is error-free and the memory is symmetric, this step does not 
require thermodynamic work |34l . 

II — Next a Quench-then-Relax procedure (QR) like the one 
described in lfl5l|34l is run on the distribution over w, q v °(w). 
In such a QR, first we replace Hf vs with a quenching Hamil¬ 
tonian chosen such that q'° is an equilibrium distribution for 
a Hamiltonian specified by the memory system macrostate: 


(While w is unchanged in this adiabatic quench, and therefore 
so is the distribution over W, in general work is required if 
H'ln ^ H%s-) Next we isothermally and quasi-statically re¬ 
lax H"' n back to Hf ys , thereby changing qj°(w) to p eq (w). (See 
also EDO 

III — Next we use the fact that m = v’o to run a QR over W in 
reverse, with the quenching Hamiltonian 


HZrM = -kTln[qZM] ( 2 ) 


where q v o 0 ut (w) = £ V] q'^(w)7:(v t \ v 0 ). This reverse QR begins 
by isothermally and quasi-statically sending Hf vs to Af¬ 
ter that H"‘ ut is replaced by Hf ys in a “reverse quench”, with no 
change to w. As in step (II), there is no change to m in step 
(HI). 

IV — Finally, as described in detail below, we reset m to 0. 
This ensures we can rerun the system, and also guarantees the 
inductive hypothesis. 

Since the system samples n{\’\ | i'o) in step (III), these four 
steps implement the map n even if tt’s output depends on its 
input, and no matter what Vo is. (The whole reason for storing 
Vo in m was to allow this step III.) 

Moreover, the expected work expended in the first three 
steps is given (with abuse of notation) by the conditional en¬ 
tropy, 


-krisAVi l V 0 ) + Y,s(q v in ) 


V\(v)-Vo(v) 


(3) 


(See Supplementary Material (SM) for proof.) 


Resetting the memory — We implement step (IV) by first run¬ 
ning a QR on the distribution over U (not IT), and then run¬ 
ning a reverse QR, one that ends with m — 0 no matter what 
the initial value of m was. 

In detail, suppose that the designer of the system guesses 
that the distribution over the initial values of the macrostates 
is %(vo) — which in general need not equal Vq. This distribu¬ 
tion would be the prior probability over the values of m if Go 
equalled Vq, since m is a copy of v 0 . The associated likelihood 
of Vi given m is 5f(vi | m ) = 7t(\’\ \ vo = m). So the poste¬ 
rior probability of m given vi, -Tf/n | iq), is proportional to 
%(nj)7r(v'i | m ). This gives the (guessed) posterior probability 
over memory microstates, which we can write as 

^(«iv 1 ) = 2^('«h'i)e , » (4) 

m 

with some abuse of notation. In contrast, the actual posterior 
distribution Vim \ vQ is given by the actual prior Vo, and gives 
a posterior distribution 

nu\v l ) = Y j nm\v l )Q m {u) (5) 

m 

The premise of this paper is that to reset the memory the 
computer first runs a QR using the quenching Hamiltonian 


( 1 ) 


H'me,n( U ) = -kT\n^(u | Vl) 


//;» = -kT\n[qliw)} 


( 6 ) 
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to drive the distribution over u to Pm em (u), since this would 
relax the memory using minimal work if the guessed prior % 
equaled the actual one, Pq. (Intuitively, vi is a “noisy mea¬ 
surement” of m that is used to set this quenching Hamilto¬ 
nian, and we are running the same process as in step II, just 
with the roles of the memory and processor reversed.) Next 
we run a reverse QR, taking pm em (u), the uniform distribution 
over all U, to the distribution that is uniform over m = 0, 
zero elsewhere. This completes the resetting of the memory 
macrostate. 

Averaging the work required in this resetting of the mem¬ 
ory, and adding it to the expression in Eq. 0, gives the mini¬ 
mal expected work for running n: 


Q&o,p 0 = -kT ( ^ !P(v 0 ,Vi)ln [^(v 0 [ Vi)] 


S P (V i | Vo) + Yj S( ^ 


V\ (v) - P 0 (v) 


(7) 


As an example, fix a prefix-free universal Turing machine 
U HU ED ET]. Identify the macrovariable v e V of the physi¬ 
cal system implementing U with the instantaneous description 
(ID) of U , so that n gives the dynamics over those IDs, i.e., it 
is the dynamical law implementing the Turing machine. I will 
say that an instantaneous description (ID) of U is a starting 
ID if it specifies that the machine U is in its initial state with 
an input string for which U halts. Also define I' r as the set 
of all starting IDs that halt with output cr, and Kyicr) as the 
Kolmogorov complexity of cr. Finally, for any starting ID a, 
define (:(») as the length of the input string of a. 

Perhaps the most natural prior probability distribution over 
IDs is the (normalized) “universal prior probability”, i.e., 
%(v) = 2~ t( - v) / A if v' is a starting ID and %(v) = 0 other¬ 
wise, where A is I/’s halting probability. It is shown in the SM 
that under the simpler of two natural definitions of how to use 
a TM U as a “calculator computer”, the minimal work (over 
all possible Pq( needed to compute cr is 


(See SM for proof.) Since (#(vi | vo) = n(v\ | vo) = P(v\ | Vo), 
we can use Bayes’ theorem to rewrite Eq. (|7]i as 


kT(c[pm ii mv 0 ))\ - cvp(Vx) ii w))] 


- 2 >( 4 ) 


n(v) - p i(v) 


( 8 ) 


(assuming all distributions over V have the same support, so 
that we don’t divide by zero when using Bayes’ theorem). 

So if % = Po, or alternatively n is an invertible function 
over V, £T# 0 ,Po = kT[Sp n (W) - Sp^W)]. This quantity is 
sometimes called “generalized Landauer cost”. Note that for 
a fixed P(wq, wi), it is independent of the partition. 


Multiple cycles of a computer — Sometimes we will want 
to use an (IID) calculator computer, in which we IID sam¬ 
ple Pq at the end of each iteration, over-writing vi, before 
running n again. In such calculators, after step (IV) above, 
the value v\ is copied to an external system via an additional 
memory apparatus (e.g., in order to drive some physical actu¬ 
ator). Then a different external system (e.g., a sensor) forms a 
sample v' Q ~ Pq, and vi gets replaced by v' r Only after these 
two new steps have we completed a full cycle. At this point 
we can run another cycle, to apply n again — but starting from 
v ' 0 rather than iq. 

In the SM it is shown that for an “extended” calculator com¬ 
puter, where 7t is iterated N times and only then is v copied to 
an external system and v ( j copied in, the total work expended 
is at least 

kr(c[P(V o) II mv 0 )\ - C[P(V N ) II Sf(Vjv)]) (9) 

Note that the expected work of a calculator has no depen¬ 
dence on the values S{q v in )\ in calculators the work depends 
only the logical map that 7t implements over V, independent 
of the physical system that implements that map. So there is 
a formal identity between the thermodynamics of (calculator) 
computers and computer science. 


kT \n(2){^Kij(a) + log[Sf 0 CO] + log h) (10) 

So the greater the gap between the log-probability that a 
randomly chosen program computes cr and the log-probability 
of the most likely such program, the greater the work to com¬ 
pute cr. Intuitively, running U on 7°" executes a many-to- 
one map in the Landauer sense, taking many starting IDs 
to the same ending ID. The gap between log[%(/' r j] and 
min,. o£ /.T log[%(i'o)] = Ku(cr) + log A quantifying “how many- 
to-one” that map is. (Similar results hold for other choices of 
space of logical variables V, machine n and / or prior Sfo-) 

As an aside, by Levin’s coding theorem ED, Ku(cr) + 
\og[%(r)] is bounded by a constant that depends only on 
U, and is independent of cr. So for any U , there is a cr- 
independent upper bound on the minimal amount of work 
needed for U to compute cr. 

Multiple users — Often rather than a single user of a calcu¬ 
lator computer there will be a distribution over users, Pr(Pf). 
To analyze this situation, use Eq. 0 to write 

(^» 0 ,A>) = ^■%-i’Pa) (11) 

(where (.) indicates an average according to Pr(.)). Apply¬ 
ing this equality to Eq. 0. and using the facts that Kullbach 
Leibler (KL) divergence is non-increasing in t and is mini¬ 
mized (at zero) when its arguments are equal [8], we see that 
the % that minimizes expected work is (Pq). The associated 
expected work is S (p>CV () ) - S <p>CVi). 

The expected work would instead be (Spi'V o) - Spi'V i)) 
if we could somehow re-optimize % for each P<). So the dif¬ 
ference between those two values of expected work can be 
viewed as the minimal penalty we must pay due to uncertainty 
about who the user is. This penalty can be re-expressed as the 
drop from t = 0 to t = 1 in the entropic variance, 

{P\n[P}) - {P)\n[{P)} (12) 

i.e., it is the growth from t — 0 to t = 1 in certainty about P. 
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Entropic variance is non-negative and non-increasingQ So 
the work penalty that arises due to growth in certainty about P 
is always non-negative. This is true even if the minimal work 
required to implement the underlying computation is negative. 

Implications for biology. — Any work expended on the pro¬ 
cessor must first be acquired as free energy from the proces¬ 
sor’s environment. However in many situations there is a limit 
on the flux of free energy through a processor’s immediate 
environment. Combined with the analysis above, such limits 
provide upper bounds on the “rate of (potentially noisy) com¬ 
putation” that can be achieved by a biological organism in that 
environment. In particular, since the minimal work required 
to do a computation increases if % ^ 7’o, using the same bio¬ 
logical organism in a new environment, differing from the one 
it is tailored for, will in general result in extra required work. 

As an example, these results bound the rate of computation 
of a human brain. Given the fitness cost of such computa¬ 
tion (the brain uses ~ 20% of the calories used by the human 
body), this bound contributes to the natural selective pressures 
on humans, in the limit that operational inefficiencies of the 
brain have already been minimized. In other words, these 
bounds suggest that natural selection imposes a tradeoff be¬ 
tween the fitness quality of a brain’s decisions, and how much 
computation is required to make those decisions. In this re¬ 
gard, it is interesting to note that the brain is famously noisy 
— and as discussed above, noise in computation reduces the 
total thermodynamic work required. 

As a second example, the rate of solar free energy incident 
upon the earth provides an upper bound on the rate of com¬ 
putation that can be achieved by the biosphere. (This bound 
holds for any choice for the partition of the biosphere’s fine¬ 
grained space into macrostates such that the dynamics over 
those macrostates executes n.) In particular it provides an up¬ 
per bound on the rate of computation that can be achieved by 
human civilization, if we remain on the surface of the earth, 
and only use sunlight to power our computation. 
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1 Resp., since entropy is a concave function of distributions, and since entropic 
variance is the average (over Pa's) of the KL divergence between P and (P). 
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DERIVATION OF EQ. 3 OF MAIN TEXT 


In this section I evaluate the expected work required to im¬ 
plement the first three steps of the system for initial distribu¬ 
tion P 0 . To do this, it will be convenient to calculate the ex¬ 
pected work to perform those steps conditioned on a particular 
t'o, and then average over all i'o according to 'Po(v’o). 

As in (34j, I assume that after step (I) the interaction Hamil¬ 
tonian between W and U is negligible. Also as in that work, 
I assume that the quench step at the beginning of step II is an 
instantaneous change to the energy of every w, AE(w). This 
process does not actually change w (such changes are associ¬ 
ated with transfer of heat). Since the quenching Hamiltonian 
depends on the value of m (which due to step (I) depends on 
Vo), the value of A E(w) for each w also depends on m. That 
change in the energy of w is identified as the work done on 
the system in the quench step when it starts (and stays) in that 
state w. 

Now due to the fact that step (I) did not change w, at the 
beginning of the quench step the posterior probability of w 
given a current value m is q'!'(w). Therefore the expected work 
done in this quench step conditioned on a particular value m is 
- H ? ys (w)]- As shorthand, define S 0 ^ as the 
Shannon entropy over W for the Boltzmann distribution with 
temperature T and Hamiltonian Hf ys . Then conditioned on a 
value Vo at the beginning of step (I), the work to perform the 
entire QR in step (II) is 

2 ql^WTM) - H% s (w)] + T(H% S ) - T(H? n ) (13) 


where m = Vq and 


r(H% s ) = h% s ■ 


kTS 


sys 


(14) 


is the equilibrium free energy of Hf ys at temperature T. 

By definition of r f{H’E) = 0. So the expression 


-o 

sys 


. Note that this amount 


in Eq. just equals kT S(q™)-S 1 

of work is negative, since work is extracted by sending q m to 
the equilibrium distribution for Hf ys . 

Similarly, to implement step (III) requires work of at least 


kT 


sys 


S Worn) 


Now for any distribution Pr(w ), with 


some abuse of notation, we can write S p r (v \ w) = 0, since 
w sets i' uniquely. Therefore 


S p r (w) = S Pr (v \ w) + S p r (w) 

= S p r (v, w) 

= S Pr (v) + S Pr (w | V) (15) 


So if we write the Shannon entropy of the distribution over 
values vq conditioned on a particular value of vo as 

S„(Vi I t'o) = ^ v °) ln t 7r(vi I v o)l (16) 

Vl 


then we can write 

S(qZ t ) = S n (V i | v 0 ) - Yj I v o)^(wi)ln(^(wi)) 

Wi,Vi 

= S n (Vi | v 0 ) + ^tt(vi | vo)S(^) (17) 

vi 

Accordingly, the total amount of work in the first three 
steps, conditioned on a value vq, is 


kT 


S(C)-S(qZ t ) 


= kT 


S{qZ) - SAVi \ vo) \ v 0 )S(q2 ) 


(18) 


Combining and averaging under T’o(vo), the expected work 
required to complete the first three steps is 


- kT 


s„(v 11 'Vo) + ^ s(o(^i(v) - n(v)) 


(19) 


(The analogous expression in much of the literature has 
S'^(Vi) instead of S n (V i | Vo); the difference is due to the 
requirement that n govern the coarse-grained dynamics even 
if its output depends on its input, a requirement that means 
that we must measure the value vq.) 


2 In steps II and III the usual convention was followed by quasi-statically send¬ 
ing If? to Hf ys and then sending Hf ys to H'° uv The same total work would 
arise if we instead quasi-statically send H™ to H'° a directly. 
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DERIVATION OF EQ. 7 OF THE MAIN TEXT 


DERIVATION OF EQ. 9 OF MAIN TEXT 


The QR in resetting the memory is run at t = 1, using 
Hm em (u). It does not change w i, just as measurement of vo 
did not change wq. Accordingly, the minimal amount of work 
in this QR is 

2 V(u I v\)[HZJ.u) - K em m + T(Hl em ) 

U 

= kr(-J>(H|vt)ln 

u 

This is true whether or not % = Pq. Note though that 
due to the fact that is defined in terms of | vq) not 
Pq(u | Vi), if both % ^ 'Po an d n is not an invertible determin¬ 
istic map, then the actual posterior P(u | vi) is not the equilib¬ 
rium distribution for This means that immediately after 

the quenching process, as the Hamiltonian over U begins to 
quasi-statically relax, the distribution over U will first settle, 
in a thermodynamically irreversible process, to the equilib¬ 
rium distribution for . No work is involved in that irre¬ 
versible process. However if we had instead chosen % = Pq, 
the expression in Eq. ([20} would have been less, i.e., less work 
would have been required, since no such irreversible process 
would have occurred. 

To complete resetting the memory we now run a reverse 
QR that takes u from the uniform distribution over all U to 
the distribution Q°(u ), whose support is restricted to u’s such 
that m = 0. This means that for the given value of vi, the 
total work required to reset m to 0, including the contribution 
evaluated in Eq. (|20]), is 

-kr( J>(« In)In 

u 

Multiply and divide the argument of the logarithm in the sum¬ 
mand by P(u | vi). Next use the same kind of decomposition 
as in Eq. and then use the chain-rule for KL divergence. 
This transforms our expression into 


%(u | Vl) 


5 ( 6 °)) 


( 21 ) 


Sf(K I V!) 


-In | V| I 


( 20 ) 


Since no work is required in the new step where we mea¬ 
sure vi, the total work in an iteration is given by adding Eq. ([8]) 
to the additional average work required to map v = vq to 
v = Vq. Since both the values vi and v^ exist outside of W, 
they can be used to specify the two quenching Hamiltonians 
that implement this map. So the additional average work is 


kTZ v y 


S(.q v JPi(y) - SWWoiv') 


. Generalizing this rea¬ 


soning gives 


kr{cmv. o) II nv 0 )\ - CVP{V N ) II ^(V w )]) (25) 

as claimed. 

Note that this result requires the computer to contain an 
integer-valued clock, whose state t increases by 1 at each 
iteration. This clock is needed so that the appropriate pos¬ 
terior &(m t | v r+ i) can be used to set H^*' m (u t ) at iteration 
t. Note that such a clock can be implemented without any 
work, since its dynamics is logically reversible. Given such 
a clock, the cross-entropies and internal entropies over itera¬ 
tions t e 2,..., N - 1 cancel out. 


- 2 nv o | vq) In [%(v 0 | V,)] - 2 n >'0 I Vi )S(£ V °) + S(fi°)) 

Vo Vo 

( 22 ) 


Averaging this according to ^’i(vq) gives 


1Z 


kT >,nv 0 ,vi)hi 


%(vo I vq) 


-J]nvo)5(e vo ) + 5((2 0 )) 


(23) 


Note though that we assumed that the states of the memory 
are symmetric. (This is why there is no expected work in step 
(I).) So S(Q V ) is independent of v, and Eq. ([23} reduces to 


kT y>(vo,vi)ln 


%(vq I vq) 


(24) 


Adding Eq. ( [24} to Eq. ([3} of the main text gives Eq. ([7} of the 
main text, as claimed. 
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DERIVATION OF EQ. 10 OF MAIN TEXT 

We are ultimately interested in the map from U’ s input tape 
to its output tape. In addition, U is a prefix machine, i.e., 
its read tape head cannot move to the left. This motivates 
defining the IDs of U as all tuples of {machine state, contents 
of output tape, contents of work tape(s), and contents of input 
tape at or to the right of the input tape read head up to the end 
of the prefix codeword on the input tape}. 

In addition, I require that U can only halt if it has reinitial¬ 
ized its work tape(s). So all IDs with U in its halt state and 
output tape containing cr have the same (blank) work tape(s). 
This means that when U halts there is no “relic” recorded in 
the work tape(s) of what the original contents of the input tape 
was. In addition, by the precise definition above of IDs, all 
IDs with U in its halt state and output tape containing cr have 
no information concerning the contents of the input tape. So 
there is a unique ID with U in its halt state and output tape 
containing cr; it does not matter what input string to U was 
used to compute cr. 

To simplify notation, let / be the transition function of U, 

i.e., write n(v' \ v) = 5 V ',/(v)- Iterating / from a starting ID vo 
eventually results in an ID V that specifies that U has halted. 
That V is a fixed point of U. Write <p{vf) for that fixed point 
arising from the ID i’o (i.e., (b is the partial function computed 
by U). Also write N(vq) for the iteration at which U halts 
(with output value 0(i'o)). Finally, define I a as the set of start¬ 
ing IDs that compute cr. Since we are interested in user distri¬ 
butions Vo that are guaranteed to compute cr, from now on I 
restrict attention to Po whose support lies within 7°". 

There are several ways to define “the expected work for U 
to compute cr” using a calculator computer. In two of the most 
natural approaches, for any specific vo e /' r , the computer is 
run some number of iterations, after which v gets copied to the 
actuator and then reset, and the total amount of work is tallied. 
Where these approaches differ is in their rules for “when v gets 
copied to the actuator and reset”. 

In one approach we start with some specified v'o e 7°" and 
then 

1. Run the computer until it halts (with output cr) at 
timestep N(v o); 

2. Copy that ending v (which is just cr) to the actuator; 

3. Set v to its next value, vf,, copied over from the sensor; 

4. Cease to exist. 

In this approach, an iteration of the calculator is identified as 
the sequence of iterations of / that takes i’ 0 to a halt state. So 
different vo will be identified with different numbers of itera¬ 
tions of /. 

A second approach is the same as this first approach, except 
that we replace step (1) in this list with iterating / starting 
from the specified v'o e l' r a total of r times. We then con¬ 
sider the limit as t —> oo. Since vo is a starting state, we are 
guaranteed that under this limit, when we reach step (4) the 


computer has halted, and the value cr has been copied to the 
actuator. Moreover, as shown below, the minimal amount of 
work expended converges under this limit. 

To evaluate the expected work in the first approach, com¬ 
bine the fact that n is a deterministic function, Eq. <Jd]», and the 
restriction on the support of Vo to write 


- V 0 (v) In[%(v)] + V N(v) (m) ln kvcv)Wv)) 

to- veI cr 

= ~ Yj Po( v) ln[Sfo(v)] + 2 Voiv) In 


vel^ 


7N(v 


vel a vel‘ 

■ Vo(v) ln[3b(v)] + 2 Po(v) In [ £ %(V) 

vef* veF v ':/ a '( v )( v , )=0( v ) 

Ev':/ w W(v')=^(v) ^o(y ) 


= J> o(v)ln 


vel a 


%(v) 


(26) 


So the optimal Vo is a delta function about the v ; e l' r that 
minimizes 


2y':/ A '< v )(v')=0(v) ^oi v ) XIy' :f Niv) ( v) 2 

%(V) “ 2^ 

and the associated minimal amount of work is 


(27) 


kT ln(2)min,,. 




m + log (%({v' : f N(v \v ') = 0(v)})j + log A 


= kT ln(2)min,, e /<7 


7(v) + log | 2- f(v,) ) 


v':/ w(v >(V)=</>(v) 


(28) 


where A is the normalization constant for Chaitin’s omega, 
i.e., the halting probability for U. 

Intuitively, in this first approach, the amount of work for 
computing cr from some v e I' r is given by the difference of 
two terms. The first is the length of v, i.e., how unlikely v is 
under Or to put it another way, it is “how much informa¬ 
tion” there is in the initial ID of U. The second term is ‘how 
much information” how much information there is concern¬ 
ing the initial ID of U by the time the computation ends. The 
bigger the drop in the amount of information concerning the 
initial ID, the more work is required to compute cr from v. 

In contrast, in the second approach, the analogous analysis 
shows that the expected work is 


> In 


e m(v)) 


2 n(v) ln[%(v)l + lim { 2 V T (<b(v)) 1 

ve/ 0- T ve/ 0- 

= - 2 n,(v) ln[%(v)] + 1 ™ { 2 n(v) In £ %(/)]} 

vel* T_> °° - T ~ • .. * 

= 2^° (v)ln 


vel°- " v':/ T (v')=0(v) 

lim T ^oo 2v'€/ <r :iV(v , )<T ^o(y ) 


%(v) 


= 2^o(v)!n 


vel a 


mn 


%(v) \ 


( 29 ) 


where the penultimate step uses the fact that <p(v) is a fixed 
point for all v e 7°", and the last step uses the fact that all 
v ; e 7°" eventually halt. 
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So defining expected work using this second approach, the as claimed in Eq. 
optimal Vq is a delta function about the v e 7°" that minimizes 
Sfo(v) oc 2~ e( - v \ But that is just the vo of minimal length in the 
set of all Vo that result in output cr. The associated minimal 
expected work is 

kT \n(2)i^K u{(t) + logmF)] + log d) (30) 


